Skip to content

[pull] master from ggml-org:master#77

Merged
pull[bot] merged 12 commits into
CrazyForks:masterfrom
ggml-org:master
May 19, 2026
Merged

[pull] master from ggml-org:master#77
pull[bot] merged 12 commits into
CrazyForks:masterfrom
ggml-org:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 19, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

rgerganov and others added 12 commits May 19, 2026 09:42
With the introduction of MTP we can have multiple compute contexts for
the same RPC device. In this case last_graph_uid is not updated properly
when contexts are being switched. This patch fixes this by moving
last_graph_uid to the device context, making sure it is always updated.

closes: #23242
* sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle

Signed-off-by: Chun Tao <chun.tao@intel.com>

* Use async mem ops for correctness when SYCL graphs are explicitly on.

Signed-off-by: Tao, Chun <chun.tao@intel.com>

---------

Signed-off-by: Chun Tao <chun.tao@intel.com>
Signed-off-by: Tao, Chun <chun.tao@intel.com>
Co-authored-by: Chun Tao <chun.tao@intel.com>
* add chapter for performance reference

* rm unsupported GPU
* llama-eval : add per-problem summary table to HTML reports

- Add chunk_idx and problem_idx to TaskState and saved case dicts
- Group completed cases by problem_idx in dump_html()
- Render per-problem summary table before individual task table
  - Columns: Problem (zero-padded), Runs, Correct (n/r),
    Tokens (min/avg/max), T/s (min/avg/max), Gen s (min/avg/max)
  - Sorted by problem index, monospace font, right-aligned numbers
  - Colspan headers for grouped stats, auto width
- Simulator: add /v1/models endpoint, timings in response,
  template-aware question matching, --dataset arg (aime/aime2025)

Assisted-by: llama.cpp:local pi

* llama-eval : add tabs for Detailed and Summary tables, apply monospace font globally

- Wrap Detailed and Summary tables in switchable tabs (Detailed active by default)
- Remove summary-section wrapper, use tab labels instead
- Apply monospace font to all tables and the top bar

Assisted-by: llama.cpp:local pi

* llama-eval : redesign top bar as CSS grid label/value pairs

- Replace flat span list with 4-column grid layout (2 pairs per row)
- Labels in muted color (#888), values in dark (#222)
- Bold dataset name and model name
- Removed media query, always uses 4 columns

Assisted-by: llama.cpp:local pi

* llama-eval : use realistic token counts and throughput in simulator

- comp_tokens: [30, 80] → [10000, 60000]
- tps_gen: derived → uniform [90.0, 110.0]
- t_gen_ms: now computed from tokens/tps

Assisted-by: llama.cpp:local pi

* llama-eval : color Answer column green/red based on correctness

Use the same .correct/.incorrect CSS classes on the Answer column
to make correct answers green and incorrect answers red.

Assisted-by: llama.cpp:local pi

* llama-eval : fix pyright errors from max(..., key=len) type inference

Use key=lambda x: len(x) instead of key=len so the type checker
infers the return type as str instead of Sized, fixing:
  - unresolved-attribute: Object of type Sized has no attribute lower
  - not-subscriptable: Cannot subscript object of type Sized

Assisted-by: llama.cpp:local pi
* save-load-state : refactor into separate phase functions

- Split monolithic main() into 4 self-contained phase functions, each
  managing its own context/sampler/batch lifecycle
- Each function tokenizes internally using its local ctx instance
- main() is now a clean orchestrator: init -> run phases -> assert results
- Proper resource cleanup on every exit path (return {} on error)

Assisted-by: llama.cpp:local pi

* save-load-state : use params.out_file instead of separate state_file

- Remove state_file parameter from all phase functions
- Each function accesses params.out_file directly
- Initialize params.out_file in main alongside params.prompt

Assisted-by: llama.cpp:local pi

* save-load-state : use smart pointers for ctx and smpl

- Replace raw llama_context* with llama_context_ptr
- Replace raw llama_sampler* with llama_sampler_ptr
- Remove all manual llama_free() and llama_sampler_free() calls
- Keep llama_batch as raw (managed manually with llama_batch_free)

Assisted-by: llama.cpp:local pi

* save-load-state : add local llama_batch_ptr RAII wrapper

- Add llama_batch_ptr struct holding llama_batch by value
- Calls llama_batch_free() in destructor
- Eliminates all manual llama_batch_free() calls

Assisted-by: llama.cpp:local pi

* save-load-state : replace printf/fprintf with logging macros

- Add log.h include
- Replace fprintf(stderr, ...) errors with LOG_ERR
- Replace fprintf(stderr, ...) info with LOG_TRC
- Replace printf output with LOG

Assisted-by: llama.cpp:local pi

* save-load-state : refactor tests to check results inline

Each follow-up phase now accepts an expected result and performs
the comparison internally instead of collecting results in main().

Assisted-by: llama.cpp:local pi

* save-load-state : improve test output readability

Add phase labels, remove redundant run prefixes, and show
PASS after each test.

Assisted-by: llama.cpp:local pi

* pi : add rule about git signing

* save-load-state : simplify llama_batch_ptr

Change get() to return a reference and remove operator*().
Use batch.get() throughout for consistency.

Assisted-by: llama.cpp:local pi

* save-load-state : extract generate_tokens helper

Factor out the repeated token generation loop into a shared
helper function used by all phases.

Assisted-by: llama.cpp:local pi

* save-load-state : update comments to use test terminology

Replace "Phase" with "Test" and list each test's steps
as bullet points.

Assisted-by: llama.cpp:local pi

* save-load-state : rename test functions

Rename to test_baseline, test_state_load, test_seq_cp_host,
test_seq_cp_device. Update comments and logs accordingly.

Assisted-by: llama.cpp:local pi

* pi : add rule to never git push without confirmation

Assisted-by: llama.cpp:local pi

* common : add model_only option to common_init_from_params

Add bool model_only parameter to skip context creation,
sampler init, and context-dependent setup.

Use in save-load-state to initialize only the model,
with each test creating its own context.

Assisted-by: llama.cpp:local pi

---------

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
Add graphs reused counter to the per-slot timing output, printed via
llama_perf_context().

Assisted-by: llama.cpp:local pi

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
* chore: Update vulnerable packages

* chore: Formatting

* refactor: Update Tailwind CSS imports

* ci: Use `ubuntu-latest` for Unit/E2E UI tests

* chore: Bump package

* fix: Add missing tag

* refactor: Enums files naming
@pull pull Bot locked and limited conversation to collaborators May 19, 2026
@pull pull Bot added the ⤵️ pull label May 19, 2026
@pull pull Bot merged commit 6db1304 into CrazyForks:master May 19, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants